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Abstract. Hidden Markov models can describe time series arising in various fields of science, by 
treating the data as noisy measurements of an arbitrarily complex Markov process. Sequential Monte 
Carlo (SMC) methods have become standard tools to estimate the hidden Markov process given the 
observations and a fixed parameter value. We review some of the recent developments allowing the 
inclusion of parameter uncertainty as well as model uncertainty. The shortcomings of the currently 
available methodology are emphasised from an algorithmic complexity perspective. The statistical 
objects of interest for time series analysis are illustrated on a toy “Lotka-Volterra” model used in 
population ecology. Some open challenges are discussed regarding the scalability of the reviewed 
methodology to longer time series, higher-dimensional state spaces and more flexible models. 

Resume. Les modeles a chaine de Markov cachee permettent de decrire les series temporelles de 
divers domaines scientifiques, en traitant les donnees comme des mesures bruitees d’un processus de 
Markov arbitrairement complexe. Les methodes de Monte Carlo sequentielles (SMC) sont devenues 
des outils standards pour I’estimation du processus de Markov cache sachant les observations et une 
valeur fixee du parametre. Nous passons en revue quelques unes des recentes avancees permettant 
de prendre en compte I’incertitude sur le parametre ainsi que I’incertitude sur le modele. Les limites 
de la methodologie actuelle sont discutees sous Tangle de la complexite algorithmique. Les objets 
statistiques d’interet pour Tanalyse des series temporelles sont illustres sur un modele jouet de type 
“Lotka-Volterra” utilise en ecologie des populations. Quelques questions ouvertes sont finalement 
posees concernant Textension de la methodologie presentee pour traiter des series de donnees plus 
longues, des espaces d’etats de dimension plus grande et des modeles plus flexibles. 


1. Setting 


1.1. Hidden Markov models 

Hidden Markov models constitute a very flexible class of models for time series data. Consider time series 
made of real-valued vectors yt GY C for a countable collection of times t G {ti,..., ...} and some integer 

dy. For simplicity we consider integer-valued times. Hidden Markov models propose to treat the observations 
(yt)ieN as if they were arising from noisy measurements of a latent Markov chain (a;t)tgN. First, the distribution 
of a Markov chain {xt)t(£N living in X C , for some integer d^, is specified through the distribution fi{dxo) of its 
initial state and the conditional distribution of each successive state f{dxt \ Xt-i), which is called the transition 


^ Department of Statistics, University of Oxford, pierre.jacob@stats.ox.ac.uk. 


© EDP Sciences, SMAI 2015 



2 


ESAIM: PROCEEDINGS AND SURVEYS 



Figure 1 . Graphical representation of the variables defined by a hidden Markov model A4. 



Figure 2. A time series of 365 observations generated according to the phytoplankton- 


zooplankton model described in Section 1.3 The observations represent daily measurements of 


phytoplankton concentrations in a volume of water. 


distribution. Then the model specifies the distribution g{dyt | x*) of each observation given the current hidden 
state, which is called the measurement distribution. Finally all those distributions are parametrized by a vector 
9 living in a set 0 C for some integer dg. We explicitly write the parameter 6 in g{dxo \ 9), f{dxt \ Xt-i,9) 
and g{dyt \ Xt,9), and a model M refers to the collection of objects {0, y, f, g}. In the following, for a sequence 
(ptjtgN (resp. and 0 < s < t, the vector {vs, ... ,Ut) (resp. (u®, ... ,u*)) is denoted by Vg-.t (resp. u®'*). 

Hidden Markov models are often represented by dependency graphs such as Figure although 9 is typically 
omitted. The graph indicates that the law of yt given Xt and 9 is independent of the other variables, the law 
of Xt given Xt-i, yt, xt+i and 9 is independent of the other variables, and so on. Figure shows a time series 
generated by a hidden Markov model to be specified in Section |1.3[ 

After having obtained some data, the user comes up with one or several candidate models, denoted by 
..., for some integer M. Hidden Markov models are routinely applied to the modeling of volatility 

of financial assets, of meteorological time series such as wind speed and direction, or of population growth in 
ecology (see [Ullini for a variety of applications). The design of a model preferably takes into account as much 
knowledge as possible on the phenomenon under study, and a particular model developed for one phenomenon 
is unlikely to be of use for others. There are exceptions: for instance, stochastic volatility models have been 
applied to pollution data in [T]. 

In the rest of the article we assume that a collection of models is given to us. Then the following questions 
naturally arise. 
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• For each model, how do the data inform about the parameters? 

• For each model, how do the data inform about the latent Markov process? 

• How do the data inform about the choice of a model? 

• How to predict future observations? 

If one can simulate synthetic datasets from a model given any parameter value, then by trial and error, one can 
gather some intuition on how models perform. Intuitively, a model that generates synthetic time series that 
resemble actual data, at least for some parameter value, is believed to be a good model, for instance in the 
sense that one would hope its prediction of future observations to be reliable. 

Statistical inference is concerned with formalizing and automatizing this ad hoc procedure, by making a 
principled connection between data and models. In Section [l.2| the Bayesian framework is shown to transform 
the above questions into various integrals. In Section |l.3| we introduce the notion of implicit model, on which 
we focus thereafter. In Section [L4| we state some desired properties of statistical computing methods for time 
series models. Section reviews some numerical methods compatible with implicit models, called “plug and 
play” methods, and discusses whether they meet the desired requirements. In particular, sequential Monte 
Carlo (SMC) and particle Markov chain Monte Carlo methods are introduced. In Section we describe a 
recently proposed method called SMC^ to perform Bayesian inference sequentially and mention its shortcomings. 
Section illustrates the method on a population growth model. Finally Section discusses the reviewed 
methodology and some open challenges. 


1.2. Objects of inference 

To begin with, the model At and a particular parameter value 9 are considered fixed, and the goal is to 
estimate the distribution of the hidden process given observations. Filtering refers to the task of estimating 
at time t the current state Xt given the available observations yo:t- The filtering distribution p{dxt \ yo-.t,9) is 
obtained using Bayes rule as 


J ip(xt) p{dxt I yo:t,9) = ^ J ip{xt) ^p{dxo \ 9) ^ /{dx^ \ giy^ \ Xs, 9) 


law of the latent Markov chain conditional law of the observations 


( 1 ) 

for any test function p. In the rest of the article, p refers to a generic test function, defined either on X, Y or 
0. The normalizing constant p{yo:t \ 9) in Eq. 0 is called the marginal likelihood of 9 and we come back to 
it in Eq. Q. Filtering sometimes refers to the distribution p{dxo.t \ yo-.t,&) of the path (or “trajectory”) xq,* 
given yo:f Prediction refers to the inference of both a future observation yt+k and a future state xt+k given 
the current observations yo-.t, for some fc > I. Whether the interest lies on future states or future observations 
depends on the application. Using the Markov property, prediction is obtained as a by-product of the filtering 
distribution. Denoting the state predictive distribution by p{dxt+k \ yo:t,9), it can be written as 


pixt+k) p{dxt+k I yo:t,9) = / p{xt+k)p{dxt I yo-.t,9) 

7x''+i ^- V-^ 


t-\-k 


n 


s-l) ^) 


filtering law 




conditional law of the future Markov chain 


( 2 ) 


The predictive distribution of yt+k given yo-t, denoted by p{dyt+k \ y+.t, 9), is then derived as 


‘^{yt+k) p{dyt+k I yo-.t,9) 



v{yt+k) p{dxt+k I yo-.t,0) g{dyt+k \ Xt+k,9). 

' -V-" '-V-' 

state prediction measurement 


( 3 ) 


Finally smoothing refers to inference of a past state Xt-k given for some I < fc < t. 
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In many realistic situations one does not know which parameter value 0 to plug in the model, even if the 
interest lies in filtering the hidden process; in other settings parameters are themselves the objects of interest. To 
learn about the parameters from the observations, a “prior” probability distribution tt^ is given to the parameter 
9 and the goal is to estimate the “posterior” distribution given the observations. The marginal likelihood of 9 
is the probability density function of the distribution of the data given 9, evaluated at the observations yo:t' 

» t t 

p{yo:t \0)= n{dxo I 9) TT f{dxs I Xs-i, 9)\\g{ys \ Xs, 9). (4) 

s=l .=0 

The marginal likelihood of 9 was the normalizing constant in Eq. Q; the “full likelihood” sometimes refers to 
the joint probability density function of xg-.t and yo-.t given 9. Then the posterior distribution ire^t of 9 given yg-t 
is defined by Bayes rule as 


f (p{9)TTe,t{d9) = - — -—— f Lp{9)p{yg.,t \ 9)TTg{d9). (5) 

Je j0P{yo:t\0)Tre{d9) Jq 

The normalizing constant in Eq. ([^ will be useful for model comparison, and we come back to it in Eq. Q. 
By taking parameter uncertainty into account, one can redefine the tasks of filtering, prediction and smoothing. 
For instance, filtering under parameter uncertainty refers to the distribution p(dxt \ yo-.t) of the current hidden 
state Xt given yg-.t, averaged over all possible parameters: 


J <p{xt) p{dxt I yg,t) 



p{dxt \ yg,t,9) T^eAd&) 



filtering given parameter posterior on parameter 


( 6 ) 


Likewise one can be interested in filtering over the full paths, prediction and smoothing under parameter 
uncertainty. 

Once parameter uncertainty is taken into account, the next source of uncertainty is at the model level. When 
several models are available, there are various ways to use the observations to compare models 

(see Chapter 7 of m, Chapter 6 of m)- A building block of model comparison is the “model evidence”. The 
evidence of is defined as the normalizing constant of its posterior distribution, that is, denoting the 

parameter 0^"*^ and its space as 0^"*^ 

= f p{yo-.t I 9)-Kg(^){d9) =p{yg:t I (7) 

70{-) 

The model evidence, which was the normalizing constant in Eq. ([^, can be understood as the density of the 
observations yg-.t given the model By introducing prior probabilities on the discrete set of model labels 

one can then consider posterior probabilities of the models given the data, obtained again 

by Bayes rule as 


=M(-) I yo,,) 


P (M = M(™)) 

E™'=i ' 


( 8 ) 


On top of parameter uncertainty, model uncertainty can be taken into account when performing filtering, 
prediction or smoothing. For instance the predictive distribution V{dyt+k \ yo-.t) of new observations yt+k given 
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yo:t under both model and parameter uncertainty can be written as 


<p{yt+k) Vidyt+k I yo:t) 


M „ . 

X! / / y^iyt+k) pidyt+k\yo:t,0,M^""'^) Trgi^) t{d0) P | i/o:t) • 


(9) 


predictive distribution of y posterior on parameter 


posterior on model 


This type of quantity is also referred to as model averaging [^. Sometimes the task consists in selecting one 
model among the M proposed ones. Then a standard procedure is to estimate the posterior odds of model 
versus model 

P (A1 = X W I yo,t) _ p {yo-.t I ,, P (X = 

¥ {M = yo-.t) ~ P {yo:t \ ^ V {M = ' ' ’ 

^^ ^^ 

Bayes factor prior odds 


Typically the value 1 is assigned to the prior odds, corresponding to uniform prior probabilities on the model 
labels, and thus the posterior odds correspond to the Bayes factor [ST]. It is well-known jSSj that the Bayes 
factor embodies the principle of Occam’s razor: “simple” models (i.e. with a low-dimensional parameter space) 
are favoured over “complex” models (i.e. with a high-dimensional parameter space) until enough data have 
accrued to support the additional complexity. 

As discussed in m, it can be confusing to specify non-zero prior probabilities on the event “model M.^^) 
is the true, data-generating model”, for each m G {1,... ,M}. Often, we do not believe that any of the M 
candidate models is the true data-generating model. In that case, we can bypass the specification of probabilities 
of the models being true, and interpret the model evidence in Eq. Q as a prior predictive criterion |3Ij . 
representing how likely the observations are under the model. The logarithm of the Bayes factor is then the 
difference between expected utilities associated to each model, under the logarithmic scoring rule, for the task 
of predicting observations using the prior m- Thus, even in the setting where each candidate model is mis- 
specified compared to the data-generating process, the use of Bayes factors is still defensible. 

To summarise, there are three layers of uncertainty in hidden Markov models: the hidden process, the model 
parameters and the model itself. Ideally each of these uncertainties should be taken into account. 

1.3. Implicit models 

The integrals presented above are in general impossible to evaluate, except for linear Gaussian models. In 
these models the latent Markov chain is an autoregressive process and the observations are Gaussian measure¬ 
ments of it. Formally xq ~ A/'(/io, Sq), the Gaussian distribution with mean /tq and variance Egj for each 
1 , 


xt = Axt-i + vt and yt-i = Bxt-i + Wt-i, 

where Vt and Wt are Gaussian random variables with zero mean and variances and Ey respectively. The 
parameter is thus 9 = (/roj Sg, A, E^,, B, Ey) and is made either of real values or of vectors and matrices of 
compatible dimensions. The linearity and Gaussianity of the model equations imply that various conditional 
distributions of interest, for instance the filtering distribution of Xt given i/o:t and 0, are also Gaussian. The 
Kalman filter [5] provides the mean and variance of these Gaussian distributions for a computational cost of 
0{t). As a by-product the likelihood in Eq. @ can be evaluated for any 9 for a cost in 0{t), which allows 
parameter estimation using standard techniques, sometimes under the name of system parameter identification, 
see e.g. [12] ■ In a nutshell, linearity and Gaussianity make integration with respect to the hidden Markov chain 
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analytically possible and thus the integrals of Section 1.2 can be either evaluated or at least approximated using 
standard techniques such as Markov chain Monte Carlo. 

As discussed in m there are some non-Gaussian models for which the filtering distribution is tractable. 
However it has been argued, see e.g. HZ], that models should preferably be proposed based on scientific grounds 
rather than on computational ones. In practice scientists often come up with complex and generative models, 
and then use linearization and Gaussian assumptions only to enjoy the numerical efficiency of Kalman filters 
and related methods (e.g. Extended Kalman filter. Ensemble Kalman filter), particularly in high-dimensional 
settings [45] . Linearization and Gaussianity assumptions result in a systematic bias in the subsequent estimation, 
compared to the results that would be obtained under the original model. 

Let us have a look at a simple generative model that has been proposed to model the interaction of phyto¬ 
plankton and zooplankton in [^[^. The “PZ” model is a variation of a Lotka-Volterra model for interactions 
between prey and predator. Phytoplankton are modeled as prey on which zooplankton are grazing. Over 
successive days indexed by t, the model describes the stochastic growth rate of prey at, the population size 
of prey pt and the population size of predator zt- Thus the hidden state Xt = {at,pt,zt) is three-dimensional. 
The stochastic growth rate at follows the same distribution every day t > 0: at ^ A/'(^Q,cr^). The initial 
distributions for both species are given by 

logpo ~ A/'(/ip,crp) and log zo ~ A/'(/iz, cr^). 


The transition of {pt) and (zt) is jointly described by the differential equation 


dpt 

dt 

dzt 

dt 


= apt - cptzt, 


= ecptZt - mizt - mqZ^. 


This is to be interpreted as follows: given a value for pt-i and Zt-i, and a draw a of at, the next states pt 
and Zt are obtained as the deterministic solution of the above equation over one time unit. In the equation, c 
represents the clearance rate of the prey, e is the growth efficiency of the predator, mi and mq are the linear 
and quadratic mortality rates of the predator. Note that {pt,Zt) can also be defined at non-integer times, and 
we could consider at to be a piecewise constant continuous time process jumping at each integer time according 
to JV{p,a,o''^)- However the observations are gathered in discrete time and thus we find it more convenient to 
specify the hidden Markov chain in discrete time as well. To summarize, given Xt-i = {at-i,pt-i, Zt-i), the 
next state is obtained by drawing at from Af{pa,o''^) and {pt,zt) is the solution of the ordinary differential 
equation above. In practice, this solution can be approximated with arbitrary precision by numerical solvers 
such as Runge-Kutta methods. Note also that the difference between the PZ model and the classical Lotka- 
Volterra lies in the addition of a quadratic mortality term, and in the randomness of the growth rate at sampled 
at each integer time. 

Given the Markov process (xtjtgN, the observations are noisy measurement of the phytoplankton, \ogyt 
Milogpt, the zooplankton are not measured. Indeed it is comparatively easier to measure the concentration 
of phytoplankton in a volume of water due to the fluorescence of the chlorophyll that they contain. For simplicity 
we set c = 0.25 and e = 0.3, Pp = Pz = log 2, Up = 0.2, Uz = 0.1, and leave the remaining constants as unknown 
(or “free”) parameters. Thus we introduce 6 = {pa,aa,o'y,'rni,mq). Figure]^ was obtained by generating a 
year of data from the model, with parameters pa = 0.7, = 0.5, Uy = 0.2, mi = 0.1, = 0.1. A prior 

distribution is put on the model parameters, simply chosen to be a uniform distribution on [0,1] for each of the 
five components. For this application the interest lies both in the prediction of future states and in parameter 
inference. 

The PZ model is very standard from a scientific point of view as Lotka-Volterra equations date back to the 
1930s (see the historical introduction in Cl)- However, from a statistical point of view the model is not linear 
nor Gaussian and thus the integrals defined in Section 1.2 are impossible to evaluate exactly. The model is 


generative in the sense that trajectories Xo-.t of the hidden Markov chain can be sampled, if not exactly, at least 
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with arbitrary precision using numerical solvers of ordinary differential equations, for any parameter 6. Then a 
series of observations ?/o:t can be simulated given a path xoit and 6. When the transition distribution / is such 
that Xt can be sampled given Xt-i and 0 but its transition density cannot be evaluated, the model is said to be 
“implicit ” m Here the ability to evaluate the transition density f(xt \ Xt-i,0) for a given triplet {xt,Xt-i,0) 
would in fact depend on the numerical solver being used, but in general we do not want to assume that we 
know how to perform this computation. 

Aside from Markov chains defined by differential equations, another scenario where sampling from the tran¬ 
sition distribution is easier than evaluating its probability density function occurs when the transition involves 
latent variables. An example is the Levy driven stochastic volatility model described in [8j. Given the previ¬ 
ous state Xt-i and a parameter 9, Xt is obtained by sampling an integer-valued random variable k, and then 
k other random variables vi-k independently from some distribution p{dv \ 9). The state Xt is obtained as 
Xt = 'ip{xt-i, k,vi-,k) for some deterministic function ^ that can be evaluated point-wise. Thus Xt is straightfor¬ 
ward to simulate, but the evaluation of its transition density involves an integral of "0 over k and vi-,k, which is 
not analytically available for general functions '0 and random variables {k,vi-,k). 


1.4. Online and exact inference 


As mentioned earlier, the integrals of Section 1.2 are impossible to evaluate exactly for general hidden Markov 

Let us generically denote by It one of these integrals, 


models such as the PZ model described in Section 1.3 


for instance the one in Eq. ([^. Monte Carlo methods have been actively developed for hidden Markov models 
inside and outside the Bayesian framework [nms], and yield a random variable It estimating It- Let us list 


some desirable features of It and of the algorithm producing it, in the context of time series. 

Numerical methods are said to be “exact” if, at any time t they produce consistent estimators of the integral 
It, where consistency is with respect to a tuning parameter N of the algorithm producing the estimator. For 
instance, if It converges to It when N goes to infinity in the L 2 or “mean square” sense, then both the bias 
and the variance of the estimator go to zero, and the algorithm is considered exact. On the contrary the use of 
Extended Kalman filters for non-linear non-Gaussian models results in a systematic bias that cannot be reduced 
to zero, and this bias is typically hard to quantify. More trivially, the estimator that always returns “one” has 
zero variance, a bias that is uniformly bounded over the time index t (if It is so itself); but it does not have an 
algorithmic parameter N allowing a trade-off between computational power and precision. 

Numerical methods for time series are “sequential” if an already obtained estimator can be “updated” upon 
the arrival of a new observation. For instance if an estimator It of It has been obtained at time t, then a 
sequential method yields It+i once yt+i is made available, for a computational cost independent of t. The 
advantage of sequential methods for time series will be illustrated in Section A sequential method is said 
to be “online” only if its performance does not deteriorate with t. Introducing the relative error r{It) of the 
estimator, typically defined as 


r(lt) 


(e [iit-itr]) 


,1/2 


then the method is online if r{It) is uniformly bounded from above over the time index t. This requirement 
rules out the use of standard Monte Garlo algorithms for hidden Markov models. For instance, in the case of 
Eq. (|^, a trivial sequential importance sampling estimator would sample N draws from the “prior” distribution 
TTg(d9)p(dxQ:t I 9) and use the conditional density p{yo-.t \ xo-,t,9) as an importance weight. The resulting 
estimator can be updated for a fixed cost per observation and yields a consistent answer when N goes to infinity. 
However its variance would typically grow exponentially with t, and thus sequential importance sampling is not 
online for this problem. 

In the next section we review existing sequential and exact Monte Carlo methods to approximate the objects 
described in Section 1.2 and we discuss whether or not they satisfy the above online requirement. 
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2. Plug and play methods 

Because of the numerous examples of implicit models as the PZ model of Section [l.3| we focus on numerical 
methods that are compatible with implicit models. These methods, called “plug and play” (or “equation free”) 
in HU, require only the ability to sample the hidden Markov chain, and to evaluate the measurement density. 

2.1. Approximate Bayesian Computation 

Perhaps the most natural plug and play algorithm for implicit models is the ABC (Approximate Bayesian 
Computation) method [HElllHS]. In a nutshell, ABC draws approximately from the posterior distribution of 
(0,Xo:t) given i/o:t using the following steps. 

(1) Draw 0 from the prior distribution TVg. 

(2) Draw xg-.t, a realisation of the hidden Markov chain given 0. 

(3) Draw yg-.t, a realisation of the observations given xg:t and 0. 

(4) If T>{yg,t,yQ.t) < e, retain {0,xg.t), otherwise go back to step (1). 

In this algorithm, V can be understood loosely as a distance defined on the observation space Y . The value 
of e has to be chosen by the user; for smaller values, the synthetic data has to be closer (in the sense of V) 
to the true data in order for (0,xg.t) to be retained. It can be shown that if I? is a true distance on Y then 
when e goes to zero, the procedure samples from the true posterior distribution. When V is not a distance (for 
instance when it is based on summary statitics), or when e is a fixed value, then the samples obtained from 
ABC are not distributed according to the posterior distribution, and it is notoriously difficult to quantify the 
bias between the obtained approximation and the target distribution. Thus ABC estimators are typically not 
exact in the sense of Section [L4l Note that ABC only requires the ability to sample from /i, / and g. 

The next section introduces the particle hlter, which is an exact, online and plug and play method to perform 
filtering and prediction for a given parameter value. On the other hand, particle filters require point-wise 
evaluations of the measurement density g, therefore they are less generally applicable than ABC. 

2.2. Particle filters 

Particle filters have become the preferred methods to deal with filtering, prediction and smoothing tasks 
|351139| , since the publication of the seminal paper m- They were initially introduced to evaluate integrals 
such as Eq. 0 for non-linear, non-Gaussian hidden Markov models. The original algorithm, or “bootstrap 
particle filter”, is described in Algorithm 


Algorithm 1 Particle filter for a given parameter 0. 

1: Draw for each k G {!,..., Xg ~ y{dxg \ 0). 

2: for t = 0 to r do 

3: [weighting] Compute for each k, = g{yt \ x^.O). 

4: [resampling] Sample ancestors ~ r{da}'^^ \ 

ak 

5: [transition] Draw for each k, Xf^i ~ f{dxt+i | x^* ,0). 

6: end for 


The algorithm requires the user to specify a number of “particles” G N. The transition and weighting 
steps propagate the samples from one distribution to the next following standard importance sampling |76j . The 
resampling step consists in selecting the particles according to their weights, so that the particles with lowest 
weights are killed while the particles with highest weights get replicated and propagated. Without the resampling 
step, only one particle would have a significant weight after a few time steps. Various resampling schemes exist 
as described in |39j . A standard resampling scheme consists in drawing each ancestor independently from 
a categorical distribution with parameters wl' so that for each j G {1,... ,Nx}, P(aJ^ = i) = 

Algorithmic the resampling distribution is denoted by r{da^'^^ \ on LinejC 
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The algorithm yields at each step t a weighted sample (aij, approximating the hltering distribution 

p{dxt I yo:t, 0)) in the sense that the integral in Eq. Q can be approximated by the weighted average 


^~^Nx 


Nx 




( 11 ) 


in a consistent manner as goes to infinity. A rich theoretical literature supports particle filters. Some of 
the important results state that the estimator in Eq. 0 satisfies a central limit theorem (CLT), that both 
the bias and the variance are of order l/N^, and that for a finite number N^, the variance of the estimator 
can be uniformly bounded over the time steps t [SSI [331 IMl 113] ■ The CLT and the time uniform results are 
remarkable since they make the particle filter “online” and “exact” in the sense of Section |1.4| Particle filters 
have therefore become standard tools for filtering in hidden Markov models. Kalman filter techniques such as 
Ensemble Kalman Filters are still used when the dimension dx of the state space X is large [l^, because the 
variance of the particle filter estimators typically grows exponentially with dx- Algorithmic improvements and 
theoretical studies of the impact of the dimension on particle filters have been recently proposed in nmsiiiii. 
Exact and online methods for large dimensional filtering problems constitute an active area of research. 

An important by-product of this algorithm is an estimator Zt{9) of the marginal likelihood of 0 at time t, as 
defined in Eq. Q. The estimator takes the simple form 

t / \ 

z.(«)=nuri:"'d. (12) 

s=0 \ k=l / 


and can thus be updated sequentially at each step of Algorithm This estimator has also been extensively 
studied in the literature. It happens to be unbiased, and by scaling the number of particles Nx linearly with the 
number of observations t, its relative variance is bounded by a constant, as proved in [22j . Hence the estimation 
of the likelihood using particle filters is not “online”: for a hxed cost per observation, the relative error goes to 
infinity. The cost of estimating the likelihood p{yo:t \ 9) can be said to be quadratic in the sense that one needs 
to choose Nx = t to guarantee a fixed relative error. Then the cost of running a particle filter with Nx particles 
for t steps is in the number of evaluations of g and draws from /. 

To perform filtering on the path space or smoothing, one can simply modify Algorithm [l] to keep track of 
the generated paths Xq.^ instead of the most recent states x’l, as in Section 3.5 of |33]. Thus one would define 

Xq = Xq on Line of Algorithm and then on Line one would define The resulting 

“path particles” (xq./., Wj approximate the filtering distribution on the full path p(dxo:t \ yo-.t, 9), mentioned 
in Section [L2| Thus, the estimator 


^~^Nx 


Nx 




converges to the integral of p with respect to p{dxo:t \ yo-.t, 9) when Nx goes to infinity and satishes a CLT 
(Chapter 9 of [33]). However when t increases the variance of this estimator quickly deteriorates due to the 
well-known path degeneracy phenomenon. The variance has been shown to increase at least quadratically, and 
in general exponentially, as a function of t in [31 mj. Indeed the resampling steps prune the population of 
distinct path particles at each time step. Let us denote by Xq.j.{s), for 0 < s < t, the s-th component of a 
path Xq.(. Then at a given time t, the latest components x]^.^'^{t) of the path particles are all distinct, 

but the first components (0) contain many replicate values. In fact the number of unique values in 

quickly decreases to only one as t increases. More precisely the number of unique elements among the 
Nx X (t-l-1) components that compose the Nx path particles x^'-N has been upper bounded by {t+l)+CNx log Nx 
in expectation in [56] . where C is independent of t and Nx- To resolve the path degeneracy issue for the 
problem of smoothing given a parameter value 0, many particle algorithms have been proposed, such as fixed-lag 
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approximations, forward filtering backward smoothing or two filter formula, as described in |38L I39j . However 
the path degeneracy phenomenon also has consequences on parameter estimation, as described in the next 
section. 


2.3. Particle-based approaches to parameter estimation 

The early attempts to estimate the parameters using particle methods involve a reparametrization where 
the parameters 9 are treated as an extra component of the hidden states. Thus a new hidden Markov model 
is introduced, where the hidden state is Xt = (xt,9t) for all times t, Xt being the hidden state of the original 
model. The new initial distribution is then TTe{d9o)fJ^{dxo \ Oq), where ttq is the prior on the parameters of the 
original model, and /r the original initial distribution. The new transition is Sg^_j^{d9t)f{dxt \ Xt-i,9t), where 
Sx represents the Dirac measure centered at the point x and / is the transition of original model. Finally the 
new measurement distribution is defined as g{dyt \ Xt, 9t). Then by performing filtering on the modihed model, 
one obtains a particle approximation of the distribution of it given yo-t, and the distribution ng^tidO) of 9 given 
yo:t is obtained as a marginal distribution thereof. 

The idea traces back to (62] , who already recognized the occurrence of path degeneracy. Indeed, the parameter 
values are resampled as part of the states, but contrary to the states they are never diversihed for 

the transition of the parameters is a delta function. Hence there are fewer and fewer unique values in the 
particle approximation of the posterior distribution TTg^tid9). We recognize the similarity with particle hltering 
on the path space, as described in the previous section. Early attempts such as [SU [HS] proposed to replace 
the delta function by a Gaussian random walk in order to introduce diversity among the parameter samples. 
Alternatively, it has been proposed to introduce Markov chain Monte Carlo (MCMC) moves to diversify the 
parameter values [131 [SO]- Those moves have the benefit of leaving the correct posterior distribution invariant. 
However the high dimensionality of {9,xo:t) makes the design of efficient Markov chain Monte Carlo moves 
challenging in the setting of hidden Markov models, as well as the high correlation between the parameters and 
the states and between consecutive states m- Finally (131 |H3] proposed specihc moves in models such that the 
distribution of 9 given Xo-.t and yo-.t only depends on xo-.t through a low dimensional sufficient statistic; see the 
early criticism in [^. Reviews of various parameter estimation methods are proposed in |59[ I60j . 

The inefficiency of standard MCMC moves and the path degeneracy phenomenon have made the various 
attempts at estimating the parameters as part of the hidden states generally unsuccessful. In the recent years 
two major advances have been proposed based on particle filters. The first one is Iterated filtering [MIISS], 
which is an optimization method relying on particle filters and an original representation of the score to find 
the maximum likelihood estimator in implicit models. The second one is particle Markov chain Monte Carlo 
|5] , a class of MCMC algorithms using particle filters to design efficient proposal distributions on the space of 
(9,Xo:t)- Here we recall a particle MCMC method called particle marginal Metropolis-Hastings (PMMH). 

The pseudo-code is given in Algorithm]^ As described in Section ^I?2\ particle filters can be defined on the 
path space and thus yield a sample (xo-g '^t)k=i trajectories approximating the distribution of the paths given 
the observations and the parameter. In the pseudo-code, “drawing a path” means randomly selecting one of 
these paths with probabilities proportional to their weights . Extracting Zt means computing the estimator 
of Eq. (12 1 . The proposal distribution q{d9* \ 9) can be a Gaussian distribution centered on 9. Intuitively, if 
the number of particles was infinite, then drawing a path among the path particles would be equivalent 
to perfect sampling from the filtering distribution p(£ia;o:t | yo-.t, d*), and the likelihood estimator Zt{9*) would 
yield a perfect evaluation of the likelihood p{yo-.t \ 9*). Thus the algorithm would be a standard Metropolis- 
Hastings with proposal distribution q{d9* \ 9)p{dxQ.^ \ yo-.t,9*) and target distribution ■Kg^t{d9)p{dxo:t \ yo-.t,9). 
The remarkable result of |6] is that for any finite N^, the PMMH algorithm also generates a Markov chain with 
invariant distribution tt g^t{d9)p(dxo-.t \ yo-.t, 9). Unsurprisingly larger values of yield better performance of 
the algorithm and convergence properties of particle MCMC methods have been studied in [31I1IZ1I171IM]- In 
terms of computational cost, the number of particles has to be chosen proportional to t in order to control 
the variance of the likelihood estimator. Thus each step of PMMH costs P, as was conjectured in [B] and more 
formally studied in gaiHi]. It is less clear how the number of iterations Ng must be chosen as a function of 
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the number of observations t, although some results obtained for standard MCMC could be informative cni- 
Under the rather optimistic assumption that Ng can be chosen independently of t, then the algorithm would 
overall be of quadratic cost with respect to the number of observations. 


Algorithm 2 Particle marginal Metropolis-Hastings. 


Set some 

Run a particle filter with particles given 
Extract Zt{9^^'>) and draw one path 
for i = 2 to Ne do 

Propose 9* ^ q{d9*\9^’'~^'>). 

Run a particle filter with particles given 9*. 
Extract Zt{9*) and draw one path XQ.f. 

Compute: 

Zt{9*)p{9*) 


a = min 1 




9: Set (6»W, 

10 : end for 


,(«)') _ 




{9* 


with probability a, 

with probability 1 — a. 


"'' 0:4 ) 


Various techniques can be used to process the output of MCMC methods to compute estimators of the model 
evidence as in Eq. 0 mss]. Evidence estimators based on particle MCMC outputs have been proposed in 
[75] , Thus particle MCMC methods constitute the first class of methods providing practical and consistent 
approximations of the objects of interest mentioned in Section 1.2 in the context of general implicit models. By 
design they are iterating over the full dataset yo:t, and thus constitute “offline” or “batch” methods, as opposed 
to the sequential and online features described in Section 1.4 Upon the arrival of a new observation yt+i, the 
algorithm has to be run again from the beginning. The result of a previous run given might only be used 
to design the proposal distribution q{d9* \ 9) and to choose the number of particles N^- 

The SMC^ algorithm has been introduced to address this issue EHEZl, and to take one step towards exact, 
online plug and play methods for Bayesian inference in implicit models. The method processes the observations 
one after the other, and provides at each step an updated estimator of the various quantities of interest such as 
the ones described in Section lOl 


3. A SEQUENTIAL PLUG AND PLAY ALGORITHM 

In the light of particle MCMC methods, which mimick the behaviour of ideal MCMC methods when goes 
to infinity, the idea of SMC^ is to mimick an ideal sequential Monte Carlo (SMC) sampler [53 in the setting 
of hidden Markov models. 

3.1. SMC samplers 

We first describe the SMC sampler algorithm that we would like to imitate in the hidden Markov model 
setting. It has been originally proposed for simpler models where it is possible to evaluate the incremental 
likelihood functions p{yt \ yo-,t-i,9), for all 9 and all t. This is typically the case in parametric models for 
independent observations, where p{yt \ yo-.t-i,9) = p{yt \ 9). Since the full likelihood can be expressed as a 
product of those incremental likelihoods, it can be evaluated point-wise, and thus the standard Metropolis- 
Hastings (MH) algorithm is applicable in this context to sample from the posterior distribution Trg t{d9) for a 
fixed dataset yo-t- SMC samplers approximate each posterior distribution ■Ke^t{d9) sequentially over the time 
t, that is, upon the arrival of new pieces of information. The notion of time can be purely artificial here, e.g. 
when the data correspond to measurements of different individuals. An adaptive SMC sampler is described in 
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Algorithmic It produces a weighted sample {Ot ^^t)k=i approximates the posterior distribution 'KB,t{d6) 
at each time t. 


Algorithm 3 Adaptive SMC sampler. 

1: Draw for each k & {1, ..., Ng} 0q ~ ng{d9). 

2: Set for each k, uj'ti = Ng^. 

3: for t = 0 to T do 
4: if ESS(wt^l^'’) <cthen 

5: Construct a proposal distribution qt-i based on the particles 

6: [resampling] Sample ancestors al'^” ~ r{da^'^'^ \ 

k 

7 : For each k, replace 9^_i by 

8: Set for each k, = Ng^. 

9 : [move] Perform an MCMC move on each particle 9^_-y using qt-i, leaving TTg^t-i{d9) invariant. 

10: end if 

11: Set for each k, 9^ = 9^_i. 

12: [weighting] Update for each k, p{yt \ 

13: Normalize for each fc, 

14: end for 


In the algorithm, the resampling step is similar to the one of Algorithm but it is triggered only when the 
effective sample size (ESS) falls below a threshold c. The ESS is an assessment of the degeneracy of the weights 
and takes values between 0 and I. It is defined as the following function of the normalized weights: 


ESS (. 


UJ 






This adaptive resampling scheme could be applied to the particle filter of Section 2.2 as well, but it proves 
crucial for SMC samplers, for complexity reasons that will become clear in Section 3.3 The combination of the 
resampling and the move steps is called the rejuvenation step. 

A simple choice of move step leaving ng t(d9) invariant is to apply a MH kernel with independent proposals 
from qt{d9). The proposal distribution can be a Gaussian distribution with mean and variance taken as the 

k\Ne 


empirical mean and variance of the particles {dt y‘-^t)k=i E3- Not® that under Bernstein-Von Mises conditions, 
the posterior distribution converges itself to a Gaussian distribution, and thus using an adaptive Gaussian 
proposal distribution in the rejuvenation step is an asymptotically optimal choice. The move step then consists 
in applying one step of MH to each of the Ng particles. 

The algorithmic parameters left to choose are the number of particles Ng and the ESS threshold c, which 
can be set to 50% by default. Higher values mean more rejuvenation steps, which constitute the bulk of 
the computational cost of the algorithm. The algorithm has been extended to a more general form in |36j . 
which allows various algorithmic improvements as well as a unified theoretical study under the Feynman-Kac 
framework. Various articles study its theoretical properties [lilllol l88] . In particular m demonstrates its 
theoretical advantage over MCMC when the posterior distribution is multimodal. The effect of triggering 
resampling steps based on an ESS criterion has been studied in m- The behaviour of the algorithm with 
respect to the dimension dg of the parameter space has been studied in El- 

Similar ly to the likelihood estimator given by particle filters in Eq. (121, SMC samplers yield an estimator 
Zt of the evidence Zt defined in Eq. ([^, that can be computed as 

t / Ne \ 


= n '^^s-lPiVs I yo-.s-l,0s) 


s—0 \k—l 


(13) 
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The form of the estimator is slightly different from the likelihood estimator in Eq. © because the resampling 
steps are not applied at every step. One way to justify it is to consider that if (Ot ^u;t-i)k=i ^ consistent 
particle approximation of Trg^t-i{dO), e.g. in probability, and remembering that the weights are normalized 

in the algorithm, then 

Ng „ 

I yo-.t-i,0^) ^ / p{yt I yo-.t-i,0)TTe,t-i{d9) =p{yt \ yo-.t-i)- 

Taking the product over time steps yields an estimator of the full evidence p{yo:t), given the 
inclusion of SMC samplers into the Feynmac-Kac framework of general particle methods allows 
the properties of this estimator, in particular |36| obtain a central limit theorem. Empirically |94j 
the advantage of SMC samplers over MCMC methods to estimate the model evidence. 

3.2. An approximate SMC sampler for hidden Markov models 

The original SMC sampler as in Algorithm [^cannot be directly applied to the hidden Markov model scenario, 
in the same way that standard MH could not be applied: the likelihood function as defined in Eq. cannot 
be evaluated point-wise, and the incremental likelihoods p{yt \ yo-.t-ij 9) cannot either. Mimicking the reasoning 
behind particle MCMC methods, particle filters can be used to obtain estimators of those likelihood terms. Eor 
simplicity, we present an SMC sampler algorithm to sample from irg^tidd) only, but the same algorithm can be 
used to sample from the joint distribution TTg^t{d9)p{dxo-,t \ yo-.t, 9), as shown in [29]. The article |47| essentially 
proposed the same algorithm independently, with challenging applications in econometrics. 

To each of the Ng parameter values produced by the SMC sampler as in Algorithm we thus attach a 
particle filter with particles as in Algorithm hence the name SMC^ evoking those two layers of particle 
approximations. To avoid confusion we will talk about 0-particles and cc-particles, and denote respectively by 
Ng and N^ their numbers. At any time t, each of the Ng 0-particles is indexed by k as in while each of the 
associated x-particles is indexed by n, k as in a;"’*. The method is described in Algorithm W 

The algorithm follows the structure of Algorithm except that each 0-particle is equipped with a particle 
filter that is updated at each step. The differences are summarised in the following two points. 

• At time t, the weight of the 0-particle 0^ is updated using an estimator p{yt \ obtained from 

the associated particle filter, instead of the true incremental likelihood p{yt \ yo-.t-i, 9^). 

• The move step to rejuvenate the 0-particles relies on particle MCMC instead of MCMC. 

A more complete description of the algorithm is given in [521 . Let us simply mention that SMC^ is a standard 
SMC sampler targeting an extended distribution which admits Trg^tid9)p{dxo-t \ yo-.t, 9) as one of its marginal 
distributions, for any number N^. Hence, the algorithm falls into the class of exact approximations, similarly 
to particle MCMC methods. Thus filtering under parameter uncertainty as defined in Eq. ®> can be addressed 
consistently, for a finite N^ and Ng going to infinity, and furthermore the algorithm is sequential by design. 
Before turning to its computational complexity, let us mention that the model evidence can be retrieved with 
the following estimator 


Thus the algorithm can consistently compute integrals such as Eq. 

3.3. Complexity of SMC samplers 

Going back to the ideal SMC sampler in Algorithm a MH move for each particle 0^ at time t requires an 
evaluation of the likelihood p{yo-.t \ 9^), which costs 0{t). If a rejuvenation step was performed at every step 
from time 0 to t, the algorithm would then cost 0{Ngt^). Fortunately, the ESS decreases slower and slower, 
and thus the rejuvenation step occurs less and less often, hence the cost typically reduces to 0{Ngt), as shown 



Ng 

E 


I yo-.s-i,9^) 


(14) 


model. The 
the study of 
demonstrate 
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Algorithm 4 SMC^ sampler. 

1: [^-initialization] Draw for each k G {1,..., Ng} 9 q ~ TTg{d9). 

2: [^-initialization] For each k, draw for each n G {1,..., ~ didxQ \ 9 q). 

3: Set for each k, uj'ti = Ng^. 

4: for t = 0 to r do 
5: if ESS(wt^l^'’) <cthen 

6: Construct a proposal distribution qt-i based on the 0-particles 

7: [0-resampling] Sample ancestors ^ r(da^'^‘’ ] 

k 

8: For each k, replace 0^_i by 

9: Set for each k, = Ng^. 

10: [0-move] Perform a PMMH move on each 0-particle 9^_i using qt-i, leaving 7re_t_i(d0) invariant. 

11: end if 

12: Set for each fc, 0^ = 0*_^ . 

13: [x-weighting] Compute for each k and n, = g{yt \ x"’^,0^). 

14: [x-resampling] Sample for each k, ^ r{da^''^^ \ 

15: [x-transition] Draw for each k and n, x"_].\ ^ f{dxt+i \ x“* ’’^,9^). 

16: Compute for each k, p{yt \ yo,t-i,9^) = A"! J2n=i 

17: [0-weighting] Update for each k, Wjy = uJi_i p{yt \ yo:t-i,0t)- 

18: Normalize for each k, w* = Wl/J2k=i 

19: end for 


in Theorem 1 of [2^. In other words, at each step t, either the assimilation of yt costs 0{t) or 0(1), whether 
or not a rejuvenation step is performed, which happens with a probability decreasing with t. Let us denote by 
Pt the probability of a rejuvenation step occurring at time t. If the other operations are of cost 1 at each step, 
then the overall cost Ct for the algorithm to reach step t satisfies 

t t 

® ='^{Ps S + {1 - Ps) i) = {t+1) + '^Ps X (s - 1) 

s—0 s—0 

which indeed is linear in t ii pt = 0{l/t). This is another formulation of the result in [^. Thus the algorithm 
is online in the sense described in Section [l.4| Note that the algorithm requires more and more memory, as a 
rejuvenation step at time t involves browsing over the past dataset yo-.t, which thus has to be kept available. 
Thus the algorithm is not “online” memory-wise but only in terms of computational cost. One can hope that 
the errors are uniformly bounded over time for a fixed Ng if the rejuvenation steps are performing equally well 
across all time steps. This motivates the adaptation of the proposal distribution qt in Algorithm]^ 

For the SMC^ algorithm of Algorithm]^ the same reasoning applies, motivated by empirical results such as 
Figure]^ to be described in the next section. The difference is that, in order to bound the errors uniformly 
over time, and for the particle MCMC steps to perform equally well across all time steps, one needs to increase 
the number of x-particles with t. Scaling Nj. linearly with t, the cost of running Ng particle filters with 
Nx x-particles for t steps is 0{Ngt^). Under the same occurrence pattern of rejuvenation steps, [53] obtain 
an overall computational cost in 0{Ngt^). The SMC^ algorithm is thus sequential but not online. Upon the 
arrival of a new piece of observation, the estimator can be updated, but one has to increase the computational 
effort linearly with the number of observations in order to obtain time uniform guarantees. A modification of 
SMC^ is proposed in [5^ so that can be automatically increased along the observations when required. The 
acceptance rate of the rejuvenation steps are monitored to assess whether is large enough at any time t. This 
modification does not make the algorithm online, but allows the automatic adjustment of the computational 
cost to guarantee a stable performance over time. 
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There currently exists no method to perform online and exact Bayesian inference for general hidden Markov 
models [sniiM], which poses a serious challenge in the presence of very long time series. In terms of scaling with 
the number of particles, SMC algorithms are very amenable to modern parallel architectures. The algorithm 
is typically of linear complexity in and in Ng, and most of the computation can be done in parallel, except 
for the resampling steps. This has motivated a series of articles in the recent years, both in the computational 
literature [TBl [551170] and in the methodological literature [23J |571 jOD] . 

4. Numerical illustration 


The PZ model of Section 1.3 is used to illustrate the various outputs of SMC^. Given the parameters set in 


Section 1.3 T = 365 observations are generated as shown on Figure The algorithm is run with Ng = 1024, 
Nx = 1024 and an ESS threshold c of 50%. The proposal distribution qt of the move steps is a Gaussian 
distribution using the empirical mean and covariance of the weighted particles T^t)k=i- Each rejuvenation 
step performs five successive PMMH moves on each particle 9^. The resampling distribution for both the 9- 
particles and the cc-particles is chosen to be the systematic resampling scheme [TT]. To diagnose the behaviour 
of an SMG^ run, the ESS of the 0-particles is plotted against time on Figure As expected the ESS decreases 
slower and slower with the time steps, resulting in only three rejuvenation steps in the second half of the dataset, 
whereas ten rejuvenation steps occurred in the first half. The acceptance rate of the move steps is found to be 
above 40% at the end of the run. 

The computational cost of the algorithm is represented on Figure]^ More precisely what is plotted is the 
number of times that the transition / and the measurement g are called, for each of the 1024 0-particles. Since 
there are 365 time steps and 1024 x-particles per 0-particle, if no PMMH move was performed there would be 
365 X 1024 « 3.7 x 10® transitions per 0-particle. Since five PMMH steps are performed at each rejuvenation 
step, the number of calls per 0-particle reaches 6 x 10®. The dashed line represents a linear regression of the 
number of calls against time, indicating that the number of calls seems to grow linearly in t. Note that this linear 
trend is obtained for a fixed = 1024. The quadratic cost of the overall method mentioned in Section |3.3| 
comes from the fact that one would eventually need to increase if observations kept arriving. Thus if we 
had two years of daily data instead of one, and if we wanted to obtain the same relative error in estimating 
the integrals of Section |1.2[ we would set to 2048 and the overall expected computational time would be 
multiplied by four. Since there are 1024 0-particles, the total number of calls to the functions / and g is in 
the billions. For the PZ model, each transition involves solving numerically a differential equation, here using 
a Runge-Kutta method RK4(3)5[2R-|-] as in [55]. In wall-clock time, this SMC^ run took about 50 minutes on 
a standard desktop computer with 8 cores using the optimized software LibBi [55]. Across runs, the random 
occurrence of rejuvenation steps incurs random computing times. We collected runtimes between 40 and 60 
minutes, using the same algorithmic parameters, over five independent runs. 

The approximation of the posterior distribution TTg j’{d9) at the final time T = 365 is represented by the 
pairwise contour plots of Figure]^ We see that the posterior distribution concentrates in the neighborhood of 
the parameter used to generate the dataset, indicated by black dots on the hgure. We note negative correlations 
between some of the parameters, in particular between mi and m^, which both explain the instantaneous decrease 
of the zooplankton population size, and between ay and Uct, which both account for the stochasticity of the 
model. We observe that the mode of the posterior distribution is not exactly located at the data-generating 
parameter because the inference is conditional upon a finite number of observations. Indeed, one could only 
expect the data-generating parameter to be recovered when the number of observations goes to infinity. Note 
also that uniform priors have been used for all the parameters, therefore the mode of the posterior distribution 
corresponds exactly to the maximum likelihood estimate. 

One advantage of sequential inference is the ability to investigate each intermediate posterior distribution 
'^e,t{d9) for t = 0,... ,T. Figure [^represents this evolution for the first 50 time steps and for each parameter. 
The grey ribbons indicate the 10%, 20%, 30%, 40%, 60%, 70%, 80%, 90% quantiles of each marginal posterior 
distribution. The dashed lines indicate the values used to generate the dataset. We see the posterior distributions 
going nearer the data-generating parameter as more observations are being assimilated. We also observe that 
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Figure 3. Effective Sample Size against time, over one run of SMC^ on the PZ model. The 
vertical dashed lines represent the resampling times and the horizontal dashed line represents 
the ESS threshold, set to 50% of Ng. 



time 


Figure 4. Cumulative cost per 0-particle during one run of SMC^ on the PZ model, with 
Nx = 1024 and five PMMH moves per rejuvenation step. The cost is measured in the number 
of calls to the transition sampling function and the measurement density function. The dashed 
line represents a linear regression of the cost over the time index. 


this concentration occurs at a different rate for each parameter. Indeed, according to asymptotic results on the 
posterior distribution (Chapter 1 of [49]), the asymptotic concentration rates depend on the Fisher information 
matrix of the model. In a non-asymptotic setting, as is the case in practice, we could imagine using plots similar 
to Figure[^to guess how many more observations would be needed to reach a given precision for each parameter. 

The possibility to perform prediction under parameter uncertainty is illustrated on Figure]^ At every time 
step, an 80% predictive region is inferred from the particle approximation of yt+i given i/o:i- The successive 
regions are joined together in a grey ribbon. The actual observations are plotted as circles if they fall in the 
predictive region, and triangles if they fall outside. At the end of the run, 77 observations have landed outside 
the predictive region, which represents 21% instead of the targeted 20%. Since the observations are generated 
from the model, it is expected that asymptotically in t, 20% would fall outside the 80% predictive region. 
Figure]^ is a close-up of Figure]^ focusing on the first 50 time steps. 
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m. 


Figure 5. Posterior distribution of the parameters of the PZ model given the synthetic dataset 
of 365 observations. Contour lines represent the estimated density of the pairwise marginal 
distributions (ctq,, /ia), (cTq,, Uy) and The dots indicate the values used to generate the 

dataset. 


The model evidence estimator of Eq. (14) can be illustrated by introducing another model. We consider a 
simplified model PZ*, which is defined as model PZ except that the quadratic mortality term is removed from 
the differential equation: 


dpt 

=apt- cptZt, 

dzt 

— = ecptZt - miZf 
at 


Thus the parameter is 0 = (/ia, ac^cry^mi) and we use the same uniform prior distributions as for the PZ model. 
We put a uniform prior over the two models and thus the posterior odds as in Eq. (10) reduce to the Bayes 
factor, p{yo-.t \ PZ)/p(yo:t | PZ*). This ratio can be obtained by approximating the evidence using the estimator 
of Eq. ( [T4| for each model. The same algorithmic parameters as described above are used for each model. 

Figure 1^ shows the Bayes factors against time, obtained from five independent runs, and Figure[^is a close- 
up on the first 100 time steps. The bottom horizontal dashed line indicates 1. A Bayes factor of 1 indicates 
no support of the data for one model compared to the other. Values close to zero support model PZ* while 
values larger than one support model PZ. The graph shows that for the first 50 observations, model PZ* seems 
supported by the data. However with more observations, the Bayes factor starts supporting the PZ model, and 
after time 100 each of the five independent runs estimates the factor above 100. The factor keeps increasing 
to extremely large values as more observations are assimilated. Here the dataset is generated using the PZ 
model so the end result does not come as a surprise. The sequential Bayes factor estimation shows Occam’s 
razor principle in action, as mentioned in Section |1.2[ the simpler model is favoured when few observations 
are available. Here about 100 data points are enough to choose the data-generating model with confidence, 
according to the Bayes factor criterion. Note how the five independent evidence estimates diverge from each 
other as observations accrue, showing that the estimator is not stable over time for a fixed number of particles 
Nx and Ng. This confirms that the evidence estimator from SMC^ is not online. 


5. Discussion 

Let us discuss again the objects of inference of Section |1.2| in the light of the reviewed methodology. For a 
given parameter value, filtering and prediction as in the integrals of Eq. 0, 0, 0 can be approximated in an 
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Figure 6. Evolution of the posterior distribution of each parameter of the PZ model 
obtained by SMC^, over the first 50 time steps. The grey ribbons indicate the 
10%, 20%, 30%, 40%, 60%, 70%, 80%, 90% quantiles. The dashed lines indicate the values used 
to generate the dataset. 
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Figure 7. One step predictions obtained using one run of SMC^ on the PZ model. The dark 
ribbon indicates the 80% predictive region of yt+i given j/o:t for each time, under parameter 
uncertainty. Observations that land in the predictive region are indicated by circles, whereas 
observations landing outside are indicated by triangles. 
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Figure 8. Same as Figure but limited to the first 50 time steps. The grey ribbons indicate 
the quantiles of one-step predictive regions. 


online manner using particle filters, as described in Section]^ There are currently no online and exact methods, 
in the sense of Secti on [l.4[ that take into account parameter uncertainty as in Eq. (§, (§, 0. The SMC2 
method proposed in |29] and m, and described in Section is sequential, in the sense that the estimators 
can be updated upon the arrival of new observations. However the incremental cost of the algorithm has to 
grow linearly with t in order to control the relative variance of the estimators. Hence a complete run of the 
algorithm has a quadratic cost in the length of the time series, and thus is not applicable for long time series. 
Informally, for the PZ model used as an illustration in Section]^ the SMC^ algorithm runs in a reasonable time 
on standard hardware, for thousands of observations, but not for millions. The numbers would change with the 
model and the application, but in general online inference under parameter uncertainty is still an open question 
and an active area of research; see Hisa for recent developments. 

One of the difficulties comes from the likelihood estimator of Eq. (121, which requires a quadratic cost in 
the number of observations to guarantee a bounded relative error. On the other hand, the Kalman filter yields 
likelihood evaluations in a linear cost, but only for linear Gaussian models. It is unclear whether an intermediate 
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Figure 9. Bayes factor of the PZ model versus the simplified PZ* model against time. The 
bottom dashed line indicates 1 and the top one indicates 100. Values larger than 1 indicate 
support for the PZ model. The full lines correspond to the estimates of the Bayes factor for 
five independent runs of the SMC^ algorithm. 



Figure 10. Same as Figure]^ but limited to the first 100 time steps. Initially the simpler 
PZ* model is preferred by the Bayes factor, but after 100 observations the more complex, 
data-generating PZ model becomes strongly supported by the criterion. 


setting exists, where the likelihood could be estimated in a super-linear but sub-quadratic cost, at least for some 
models. 

For time series of moderate length, filtering, prediction and parameter inference are still challenging problems 
when the dimension dx of the state space X is large. Indeed the variance of standard particle filter estimators 
typically increases exponentially with dx- Recent developments such as [m [74] could help scaling particle 
methods to larger dimensions, with many potential applications in spatial state space models |24] . Another 
issue specific to large-dimensional state space models is the large com puter memory required to store the 
particles, and especially the paths in the notation of Section 2.2). Large memory usage also involves 

large communication costs on distributed hardware, whenever particles have to be sent from one machine to 
another. The expected memory usage of storing the paths has been studied in [^. Methods to reduce the 
memory and communication costs on distributed hardware have been proposed in [TTIIIEI. 

Another challenge is to adapt computational methods to larger classes of models. The plug and play methods 
described in Section are compatible with parametric models, where the latent process can be simulated and 
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the measurement density can be evaluated point-wise. On top of hidden Markov models, particle filters can be 
implemented for non-Markovian models, as long as these two requirements are met. Inference in non-Markovian 
models using particle methods has been recently considered in [63]. The performance of particle methods in 
non-Markovian settings has been partially studied in |55|. Recent applications of non-Markovian models, for 
instance in probabilistic programming [91] . motivate further research in this direction. 

A number of articles have considered non-parametric hidden Markov models. The authors of m consider 
linear models with a transition equation of the form xt = AtXt-i + GtVt, where the distribution of the noise Vt 
is modelled with a Dirichlet process mixture. A non-parametric model is also considered for the measurement 
noise. Estimation is then performed using Markov chain Monte Carlo or sequential Monte Carlo methods. 
In the more recent literature, |46| consider a transition equation of the form Xt = f{xt-i) + Vt and put a 
Gaussian process prior on the function /; particle Markov chain Monte Carlo methods then enable inference 
under parameter uncertainty. Note that combining non-parametric models for the function / and for the noise 
Vt is not obvious because of identifiability issues. Other instances of non-parametric hidden Markov models 
consider the case where the hidden process lives on an infinite but discrete state space |44l [84] . Particle Markov 
chain Monte Carlo methods have recently been used in this context [55]. The case of finite state space and 
non-parametric specification of the measurement distribution is considered in |92| . Sequential algorithms to 
perform inference in continuous space, non-linear, non-parametric hidden Markov models would constitute an 
interesting addition to the current methodology. 

The author gratefully acknowledges EPSRC for funding this research through grant EP/K009362/1, thanks the organizers 
of the Journees MAS 2014, thanks Arnaud Doucet, Lawrence Murray and Aimee Taylor for useful comments. This article 
is dedicated to the memory of Philip Perry and some long discussions on the Bayesian approach. 
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